Take-home Exercise 3

Author

Chang Fang Yu

Modified

March 25, 2025

Prototype of Shiny app Module 1. Machine Learning Based Tender Clissification

Load Data

Raw data

Rows: 18,638
Columns: 7
$ tender_no            <chr> "ACR000ETT18300010", "ACR000ETT18300011", "ACR000…
$ tender_description   <chr> "SUPPLY, DESIGN, DEVELOPMENT, CUSTOMIZATION, DELI…
$ agency               <chr> "Accounting And Corporate Regulatory Authority", …
$ award_date           <chr> "11/6/2019", "10/5/2019", "30/4/2019", "29/8/2019…
$ tender_detail_status <chr> "Awarded to Suppliers", "Awarded to No Suppliers"…
$ supplier_name        <chr> "AZAAS PTE. LTD.", "Unknown", "ACCENTURE SG SERVI…
$ awarded_amt          <dbl> 2305880.0, 0.0, 2035000.0, 30700373.9, 178800.0, …
[1] "tender_no"            "tender_description"   "agency"              
[4] "award_date"           "tender_detail_status" "supplier_name"       
[7] "awarded_amt"         

There are 18638 rows and 7 columns in the GP data set.

  tender_no         tender_description    agency           award_date       
 Length:18638       Length:18638       Length:18638       Length:18638      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 tender_detail_status supplier_name       awarded_amt       
 Length:18638         Length:18638       Min.   :0.000e+00  
 Class :character     Class :character   1st Qu.:7.000e+03  
 Mode  :character     Mode  :character   Median :1.647e+05  
                                         Mean   :5.540e+06  
                                         3rd Qu.:8.227e+05  
                                         Max.   :1.493e+09  

EDA

spc_tbl_ [18,638 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ tender_no           : chr [1:18638] "ACR000ETT18300010" "ACR000ETT18300011" "ACR000ETT19300001" "ACR000ETT19300002" ...
 $ tender_description  : chr [1:18638] "SUPPLY, DESIGN, DEVELOPMENT, CUSTOMIZATION, DELIVERY, INSTALLATION, TESTING, COMMISSIONING AND MAINTENANCE OF A"| __truncated__ "APPLICATION ENHANCEMENT, CUSTOMISATION, MIGRATION, DELIVERY, INSTALLATION, TESTING AND COMMISSIONING OF THE FUL"| __truncated__ "PROVISION OF CONSULTANCY SERVICES FOR STRATEGIC BUSINESS PROCESSES RE-ENGINEERING (SBPR) AND ACRA'S IT INFRASTRUCTURE" "SUPPLY, DELIVERY, DESIGN, CUSTOMISATION, INSTALLATION, CONFIGURATION, TESTING, COMMISSIONING OF A FULLY OPERATI"| __truncated__ ...
 $ agency              : chr [1:18638] "Accounting And Corporate Regulatory Authority" "Accounting And Corporate Regulatory Authority" "Accounting And Corporate Regulatory Authority" "Accounting And Corporate Regulatory Authority" ...
 $ award_date          : chr [1:18638] "11/6/2019" "10/5/2019" "30/4/2019" "29/8/2019" ...
 $ tender_detail_status: chr [1:18638] "Awarded to Suppliers" "Awarded to No Suppliers" "Awarded to Suppliers" "Awarded to Suppliers" ...
 $ supplier_name       : chr [1:18638] "AZAAS PTE. LTD." "Unknown" "ACCENTURE SG SERVICES PTE. LTD." "TECH MAHINDRA LIMITED (SINGAPORE BRANCH)" ...
 $ awarded_amt         : num [1:18638] 2305880 0 2035000 30700374 178800 ...
 - attr(*, "spec")=
  .. cols(
  ..   tender_no = col_character(),
  ..   tender_description = col_character(),
  ..   agency = col_character(),
  ..   award_date = col_character(),
  ..   tender_detail_status = col_character(),
  ..   supplier_name = col_character(),
  ..   awarded_amt = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
           tender_no   tender_description               agency 
                   0                    0                    0 
          award_date tender_detail_status        supplier_name 
                   0                    0                    0 
         awarded_amt 
                   0 
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 7.000e+03 1.647e+05 5.540e+06 8.227e+05 1.493e+09 
[1] "Awarded to Suppliers"      "Awarded to No Suppliers"  
[3] "Awarded by Items"          "Award by interface record"

Data Cleaning

Drop the “Awared to No Suppliers” date point. Data set “Cleaned_GP” after clean

# A tibble: 17,855 × 7
   tender_no         tender_description   agency award_date tender_detail_status
   <chr>             <chr>                <chr>  <chr>      <chr>               
 1 ACR000ETT18300010 SUPPLY, DESIGN, DEV… Accou… 11/6/2019  Awarded to Suppliers
 2 ACR000ETT19300001 PROVISION OF CONSUL… Accou… 30/4/2019  Awarded to Suppliers
 3 ACR000ETT19300002 SUPPLY, DELIVERY, D… Accou… 29/8/2019  Awarded to Suppliers
 4 ACR000ETT19300003 PROVISION OF MEDIA … Accou… 6/8/2019   Awarded to Suppliers
 5 ACR000ETT19300004 INVITATION TO TENDE… Accou… 5/11/2019  Awarded to Suppliers
 6 ACR000ETT20300002 INVITATION TO TENDE… Accou… 10/11/2020 Awarded by Items    
 7 ACR000ETT20300002 INVITATION TO TENDE… Accou… 10/11/2020 Awarded by Items    
 8 ACR000ETT20300003 PROVISION OF AN IT … Accou… 9/12/2020  Awarded to Suppliers
 9 ACR000ETT20300004 CONCEPTUALIZATION, … Accou… 9/3/2021   Awarded to Suppliers
10 ACR000ETT21000001 DESIGN, DEVELOPMENT… Accou… 6/9/2021   Awarded to Suppliers
# ℹ 17,845 more rows
# ℹ 2 more variables: supplier_name <chr>, awarded_amt <dbl>

`

Now let’s do an prototype of module 1: creating an LDA classification by Tender_description. ### Text Cleaning-Set stop words

LDA Topic Modeling

Set k=7

Document-Topic Probability Matrix

K-means Clustering on Document-Topic Distributions

# A tibble: 6 × 9
  document            `1`     `2`     `3`    `4`     `5`     `6`     `7` cluster
  <chr>             <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>   <dbl> <fct>  
1 ACR000ETT183000… 0.337  0.177   0.0738  0.164  0.232   0.00766 0.00766 3      
2 ACR000ETT193000… 0.128  0.0125  0.0126  0.218  0.0125  0.0125  0.604   3      
3 ACR000ETT193000… 0.210  0.0115  0.0115  0.0115 0.732   0.0115  0.0115  2      
4 ACR000ETT193000… 0.0231 0.0231  0.0231  0.358  0.317   0.233   0.0231  5      
5 ACR000ETT193000… 0.0847 0.00986 0.00987 0.712  0.00985 0.164   0.00985 5      
6 ACR000ETT203000… 0.224  0.00919 0.00919 0.730  0.00919 0.00919 0.00919 5      
[1] 11658     9
           document cluster Topic Probability
1 ACR000ETT18300010       3     1  0.33709128
2 ACR000ETT19300001       3     1  0.12760314
3 ACR000ETT19300002       2     1  0.21029989
4 ACR000ETT19300003       5     1  0.02305609
5 ACR000ETT19300004       5     1  0.08471801
6 ACR000ETT20300002       5     1  0.22431128
[1] 81606     4

Visualizing Topic Probabilities for Clusters

# A tibble: 70 × 3
# Groups:   topic [7]
   topic term           beta
   <int> <chr>         <dbl>
 1     1 software    0.0193 
 2     1 provision   0.0185 
 3     1 programme   0.0178 
 4     1 tender      0.0160 
 5     1 training    0.0151 
 6     1 support     0.0148 
 7     1 design      0.0127 
 8     1 insurance   0.0101 
 9     1 school      0.00991
10     1 development 0.00942
# ℹ 60 more rows